8 research outputs found

    Towards information profiling: data lake content metadata management

    Get PDF
    There is currently a burst of Big Data (BD) processed and stored in huge raw data repositories, commonly called Data Lakes (DL). These BD require new techniques of data integration and schema alignment in order to make the data usable by its consumers and to discover the relationships linking their content. This can be provided by metadata services which discover and describe their content. However, there is currently a lack of a systematic approach for such kind of metadata discovery and management. Thus, we propose a framework for the profiling of informational content stored in the DL, which we call information profiling. The profiles are stored as metadata to support data analysis. We formally define a metadata management process which identifies the key activities required to effectively handle this.We demonstrate the alternative techniques and performance of our process using a prototype implementation handling a real-life case-study from the OpenML DL, which showcases the value and feasibility of our approach.Peer ReviewedPostprint (author's final draft

    Keeping the data lake in form: DS-kNN datasets categorization using proximity mining

    Get PDF
    With the growth of the number of datasets stored in data repositories, there has been a trend of using Data Lakes (DLs) to store such data. DLs store datasets in their raw formats without any transformations or preprocessing, with accessibility available using schema-on-read. This makes it difficult for analysts to find datasets that can be crossed and that belong to the same topic. To support them in this DL governance challenge, we propose in this paper an algorithm for categorizing datasets in the DL into pre-defined topic-wise categories of interest. We utilise a k-NN approach for this task which uses a proximity score for computing similarities of datasets based on metadata. We test our algorithm on a real-life DL with a known ground-truth categorization. Our approach is successful in detecting the correct categories for datasets and outliers with a precision of more than 90% and recall rates exceeding 75% in specific settings.Peer ReviewedPostprint (author's final draft

    DS-Prox : dataset proximity mining for governing the data lake

    Get PDF
    With the arrival of Data Lakes (DL) there is an increasing need for efficient dataset classification to support data analysis and information retrieval. Our goal is to use meta-features describing datasets to detect whether they are similar. We utilise a novel proximity mining approach to assess the similarity of datasets. The proximity scores are used as an efficient first step, where pairs of datasets with high proximity are selected for further time-consuming schema matching and deduplication. The proposed approach helps in early-pruning unnecessary computations, thus improving the efficiency of similar-schema search. We evaluate our approach in experiments using the OpenML online DL, which shows significant efficiency gains above 25% compared to matching without early-pruning, and recall rates reaching higher than 90% under certain scenarios.Peer ReviewedPostprint (author's final draft

    Keeping the data lake in form: proximity mining for pre-filtering schema matching

    Get PDF
    Data Lakes (DLs) are large repositories of raw datasets from disparate sources. As more datasets are ingested into a DL, there is an increasing need for efficient techniques to profile them and to detect the relationships among their schemata, commonly known as holistic schema matching. Schema matching detects similarity between the information stored in the datasets to support information discovery and retrieval. Currently, this is computationally expensive with the volume of state-of-the-art DLs. To handle this challenge, we propose a novel early-pruning approach to improve efficiency, where we collect different types of content metadata and schema metadata about the datasets, and then use this metadata in early-pruning steps to pre-filter the schema matching comparisons. This involves computing proximities between datasets based on their metadata, discovering their relationships based on overall proximities and proposing similar dataset pairs for schema matching. We improve the effectiveness of this task by introducing a supervised mining approach for effectively detecting similar datasets which are proposed for further schema matching. We conduct extensive experiments on a real-world DL which proves the success of our approach in effectively detecting similar datasets for schema matching, with recall rates of more than 85% and efficiency improvements above 70%. We empirically show the computational cost saving in space and time by applying our approach in comparison to instance-based schema matching techniques.This research was partially funded by the European Commission through the Erasmus Mundus Joint Doctorate (IT4BI-DC).Peer ReviewedPostprint (author's final draft

    Dataset proximity mining for supporting schema matching and data lake governance

    Get PDF
    Tesi en modalitat de cotutela Universitat Politècnica de Catalunya i Université Libre de BruxellesWith the huge growth in the amount of data generated by information systems, it is common practice today to store datasets in their raw formats (i.e., without any data preprocessing or transformations) in large-scale data repositories called Data Lakes (DLs). Such repositories store datasets from heterogeneous subject-areas (covering many business topics) and with many different schemata. Therefore, it is a challenge for data scientists using the DL for data analysis to find relevant datasets for their analysis tasks without any support or data governance. The goal is to be able to extract metadata and information about datasets stored in the DL to support the data scientist in finding relevant sources. This shapes the main goal of this thesis, where we explore different techniques of data profiling, holistic schema matching and analysis recommendation to support the data scientist. We propose a novel framework based on supervised machine learning to automatically extract metadata describing datasets, including computation of their similarities and data overlaps using holistic schema matching techniques. We use the extracted relationships between datasets in automatically categorizing them to support the data scientist in finding relevant datasets with intersection between their data. This is done via a novel metadata-driven technique called proximity mining which consumes the extracted metadata via automated data mining algorithms in order to detect related datasets and to propose relevant categories for them. We focus on flat (tabular) datasets organised as rows of data instances and columns of attributes describing the instances. Our proposed framework uses the following four main techniques: (1) Instance-based schema matching for detecting relevant data items between heterogeneous datasets, (2) Dataset level metadata extraction and proximity mining for detecting related datasets, (3) Attribute level metadata extraction and proximity mining for detecting related datasets, and finally, (4) Automatic dataset categorization via supervised k-Nearest-Neighbour (kNN) techniques. We implement our proposed algorithms via a prototype that shows the feasibility of this framework. We apply the prototype in an experiment on a real-world DL scenario to prove the feasibility, effectiveness and efficiency of our approach, whereby we were able to achieve high recall rates and efficiency gains while improving the computational space and time consumption by two orders of magnitude via our proposed early-pruning and pre-filtering techniques in comparison to classical instance-based schema matching techniques. This proves the effectiveness of our proposed automatic methods in the early-pruning and pre-filtering tasks for holistic schema matching and the automatic dataset categorisation, while also demonstrating improvements over human-based data analysis for the same tasks.Amb l’enorme creixement de la quantitat de dades generades pels sistemes d’informació, és habitual avui en dia emmagatzemar conjunts de dades en els seus formats bruts (és a dir, sense cap pre-processament de dades ni transformacions) en dipòsits de dades a gran escala anomenats Data Lakes (DL). Aquests dipòsits emmagatzemen conjunts de dades d’àrees temàtiques heterogènies (que abasten molts temes empresarials) i amb molts esquemes diferents. Per tant, és un repte per als científics de dades que utilitzin la DL per a l’anàlisi de dades trobar conjunts de dades rellevants per a les seves tasques d’anàlisi sense cap suport ni govern de dades. L’objectiu és poder extreure metadades i informació sobre conjunts de dades emmagatzemats a la DL per donar suport al científic en trobar fonts rellevants. Aquest és l’objectiu principal d’aquesta tesi, on explorem diferents tècniques de perfilació de dades, concordança d’esquemes holístics i recomanació d’anàlisi per donar suport al científic. Proposem un nou marc basat en l’aprenentatge automatitzat supervisat per extreure automàticament metadades que descriuen conjunts de dades, incloent el càlcul de les seves similituds i coincidències de dades mitjançant tècniques de concordança d’esquemes holístics. Utilitzem les relacions extretes entre conjunts de dades per categoritzar-les automàticament per donar suport al científic del fet de trobar conjunts de dades rellevants amb la intersecció entre les seves dades. Això es fa mitjançant una nova tècnica basada en metadades anomenada mineria de proximitat que consumeix els metadades extrets mitjançant algoritmes automatitzats de mineria de dades per tal de detectar conjunts de dades relacionats i proposar-ne categories rellevants. Ens centrem en conjunts de dades plans (tabulars) organitzats com a files d’instàncies de dades i columnes d’atributs que descriuen les instàncies. El nostre marc proposat utilitza les quatre tècniques principals següents: (1) Esquema de concordança basat en instàncies per detectar ítems rellevants de dades entre conjunts de dades heterogènies, (2) Extracció de metadades de nivell de dades i mineria de proximitat per detectar conjunts de dades relacionats, (3) Extracció de metadades a nivell de atribut i mineria de proximitat per detectar conjunts de dades relacionats i, finalment, (4) Categorització de conjunts de dades automàtica mitjançant tècniques supervisades per k-Nearest-Neighbour (kNN). Posem en pràctica els nostres algorismes proposats mitjançant un prototip que mostra la viabilitat d’aquest marc. El prototip s’experimenta en un escenari DL real del món per demostrar la viabilitat, l’eficàcia i l’eficiència del nostre enfocament, de manera que hem pogut aconseguir elevades taxes de record i guanys d’eficiència alhora que millorem el consum computacional d’espai i temps mitjançant dues ordres de magnitud mitjançant el nostre es van proposar tècniques de poda anticipada i pre-filtratge en comparació amb tècniques de concordança d’esquemes basades en instàncies clàssiques. Això demostra l'efectivitat dels nostres mètodes automàtics proposats en les tasques de poda inicial i pre-filtratge per a la coincidència d'esquemes holístics i la classificació automàtica del conjunt de dades, tot demostrant també millores en l'anàlisi de dades basades en humans per a les mateixes tasques.Avec l’énorme croissance de la quantité de données générées par les systèmes d’information, il est courant aujourd’hui de stocker des ensembles de données (datasets) dans leurs formats bruts (c’est-à-dire sans prétraitement ni transformation de données) dans des référentiels de données à grande échelle appelés Data Lakes (DL). Ces référentiels stockent des ensembles de données provenant de domaines hétérogènes (couvrant de nombreux sujets commerciaux) et avec de nombreux schémas différents. Par conséquent, il est difficile pour les data-scientists utilisant les DL pour l’analyse des données de trouver des datasets pertinents pour leurs tâches d’analyse sans aucun support ni gouvernance des données. L’objectif est de pouvoir extraire des métadonnées et des informations sur les datasets stockés dans le DL pour aider le data-scientist à trouver des sources pertinentes. Cela constitue l’objectif principal de cette thèse, où nous explorons différentes techniques de profilage de données, de correspondance holistique de schéma et de recommandation d’analyse pour soutenir le data-scientist. Nous proposons une nouvelle approche basée sur l’intelligence artificielle, spécifiquement l’apprentissage automatique supervisé, pour extraire automatiquement les métadonnées décrivant les datasets, calculer automatiquement les similitudes et les chevauchements de données entre ces ensembles en utilisant des techniques de correspondance holistique de schéma. Les relations entre datasets ainsi extraites sont utilisées pour catégoriser automatiquement les datasets, afin d’aider le data-scientist à trouver des datasets pertinents avec intersection entre leurs données. Cela est fait via une nouvelle technique basée sur les métadonnées appelée proximity mining, qui consomme les métadonnées extraites via des algorithmes de data mining automatisés afin de détecter des datasets connexes et de leur proposer des catégories pertinentes. Nous nous concentrons sur des datasets plats (tabulaires) organisés en rangées d’instances de données et en colonnes d’attributs décrivant les instances. L’approche proposée utilise les quatres principales techniques suivantes: (1) Correspondance de schéma basée sur l’instance pour détecter les éléments de données pertinents entre des datasets hétérogènes, (2) Extraction de métadonnées au niveau du dataset et proximity mining pour détecter les datasets connexes, (3) Extraction de métadonnées au niveau des attributs et proximity mining pour détecter des datasets connexes, et enfin, (4) catégorisation automatique des datasets via des techniques supervisées k-Nearest-Neighbour (kNN). Nous implémentons les algorithmes proposés via un prototype qui montre la faisabilité de cette approche. Nous appliquons ce prototype à une scénario DL du monde réel pour prouver la faisabilité, l’efficacité et l’efficience de notre approche, nous permettant d’atteindre des taux de rappel élevés et des gains d’efficacité, tout en diminuant le coût en espace et en temps de deux ordres de grandeur, via nos techniques proposées d’élagage précoce et de pré-filtrage, comparé aux techniques classiques de correspondance de schémas basées sur les instances. Cela prouve l’efficacité des méthodes automatiques proposées dans les tâches d’élagage précoce et de pré-filtrage pour la correspondance de schéma holistique et la cartegorisation automatique des datasets, tout en démontrant des améliorations par rapport à l’analyse de données basée sur l’humain pour les mêmes tâches.Postprint (published version

    Towards information profiling: data lake content metadata management

    No full text
    There is currently a burst of Big Data (BD) processed and stored in huge raw data repositories, commonly called Data Lakes (DL). These BD require new techniques of data integration and schema alignment in order to make the data usable by its consumers and to discover the relationships linking their content. This can be provided by metadata services which discover and describe their content. However, there is currently a lack of a systematic approach for such kind of metadata discovery and management. Thus, we propose a framework for the profiling of informational content stored in the DL, which we call information profiling. The profiles are stored as metadata to support data analysis. We formally define a metadata management process which identifies the key activities required to effectively handle this.We demonstrate the alternative techniques and performance of our process using a prototype implementation handling a real-life case-study from the OpenML DL, which showcases the value and feasibility of our approach.Peer Reviewe

    Keeping the data lake in form: DS-kNN datasets categorization using proximity mining

    No full text
    With the growth of the number of datasets stored in data repositories, there has been a trend of using Data Lakes (DLs) to store such data. DLs store datasets in their raw formats without any transformations or preprocessing, with accessibility available using schema-on-read. This makes it difficult for analysts to find datasets that can be crossed and that belong to the same topic. To support them in this DL governance challenge, we propose in this paper an algorithm for categorizing datasets in the DL into pre-defined topic-wise categories of interest. We utilise a k-NN approach for this task which uses a proximity score for computing similarities of datasets based on metadata. We test our algorithm on a real-life DL with a known ground-truth categorization. Our approach is successful in detecting the correct categories for datasets and outliers with a precision of more than 90% and recall rates exceeding 75% in specific settings.Peer Reviewe

    DS-Prox : dataset proximity mining for governing the data lake

    No full text
    With the arrival of Data Lakes (DL) there is an increasing need for efficient dataset classification to support data analysis and information retrieval. Our goal is to use meta-features describing datasets to detect whether they are similar. We utilise a novel proximity mining approach to assess the similarity of datasets. The proximity scores are used as an efficient first step, where pairs of datasets with high proximity are selected for further time-consuming schema matching and deduplication. The proposed approach helps in early-pruning unnecessary computations, thus improving the efficiency of similar-schema search. We evaluate our approach in experiments using the OpenML online DL, which shows significant efficiency gains above 25% compared to matching without early-pruning, and recall rates reaching higher than 90% under certain scenarios.Peer Reviewe
    corecore